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ABSTRACT 

While keyword query empowers ordinary users to search vast amount 
of data, the ambiguity of keyword query makes it difficult to effec- 
tively answer keyword queries, especially for short and vague key- 
word queries. To address this challenging problem, in this paper we 
propose an approach that automatically diversifies XML keyword 
search based on its different contexts in the XML data. Given a 
short and vague keyword query and XML data to be searched, we 
firstly derive keyword search candidates of the query by a clas- 
sific feature selection model. And then, we design an effective 
XML keyword search diversification model to measure the quality 
of each candidate. After that, three efficient algorithms are pro- 
posed to evaluate the possible generated query candidates repre- 
senting the diversified search intentions, from which we can find 
and return top-fc qualified query candidates that are most relevant 
to the given keyword query while they can cover maximal number 
of distinct results. At last, a comprehensive evaluation on real and 
synthetic datasets demonstrates the effectiveness of our proposed 
diversification model and the efficiency of our algorithms. 

1, INTRODUCTION 

Keyword search on structured and semi-structured data has at- 
tracted much research interest recently, as it enables common users 
to retrieve information from such structured data sources without 
the need to learn sophisticated query languages and database struc- 
ture fV\. In general, the more keywords a given keyword query con- 
tains, the easier the search semantics of the keyword query can be 
identified. However, when the given keyword query only contains 
a small number of vague keywords, it will become a very chal- 
lenging problem to derive the search semantics of the query due 
to the high ambiguity of this type of keyword queries. Although 
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sometimes user involvement is helpful to identify search semantics 
of keyword queries, it is not always applicable to rely on users be- 
cause the keyword queries may also come from system application. 
In this application case, web or database search engine may need to 
automatically compute the search semantics of short and frequent 
keyword queries only based on the data to be searched. The derived 
search semantics will be maintained and updated in an off-line way. 
Once a keyword query is issued by the real users, its corresponding 
search semantics can be directly used to make an instant response. 
In this paper, we mainly pay attention to the problem of effectively 
deriving the search semantics of keyword queries with the consid- 
eration of data only, which does not receive much closer attention 
in the previous works. 

Example 1. Consider a simple keyword query q=jdatabase, 
query} over the DBLP dataset. There are 21,260 publications con- 
taining the keyword "database", and 9,896 publications containing 
the keyword "query", which contributes 2040 results that contain 
the two given keywords together. If we directly explore and under- 
stand the keyword search results, it would be time consuming and 
not user-friendly due to the huge number of results. It needs to take 
54.22 seconds for just computing all the SLCA results of q by us- 
ing XRank Even if the system processing time is acceptable 
by accelerating the keyword query evaluation with efficient algo- 
rithms [3, 4J, the unclear and repeated search intentions in the 
large set of retrieved results will make users frustrating. To address 
the problem, we derive different search semantics of the original 
query from the different contexts of the XML data to be searched, 
which can be used to represent the different search intentions of 
the original query. In this work, the contexts can be modeled by 
extracting some relevant feature terms of the query keywords from 
the XML data, as shown in Table\l\ And then, we compute the key- 
word search results for each search intention. Table\2\shows part 
of statistic information of the answers related to the keyword query 
q, which classifies each ambiguous keyword query into different 
search intentions. 

By exploring the different feature terms of the query keywords, 
we have two benefits: the first is to diversify the keyword search 
results automatically by the different search intentions, which can 
return more distinct and diversified results to users; and the second 
is to improve the efficiency of keyword search because the contexts 



Table 1: Top 10 selected feature terms of q 



koyword 


features 


database 


systems; relational; protein; distributed; 
oriented; image; sequence; search; 
model; large. 


query 


language; expansion; optimization; evaluation; 
complexity; log; efficient; distributed; 
semantic; translation. 



Table 2: Part of statistic information for q 



#results 


database systems query + 


language expansion optimization evaluation complexity 
71 5 68 13 1 


#results 


log efficient distributed semantic translation 
12 17 50 14 8 


#results 


relational database query + 


language expansion optimization evaluation complexity 
40 20 8 


#results 


log efficient distributed semantic translation 
2 11 5 7 5 







of diversified keyword queries can be used to reduce the size of 
relevant keyword node lists. 

Therefore, we are motivated to study the problem of keyword 
search diversification based on the contexts of query keywords in 
XML data to be searched, which is denoted as intent-based diver- 
sification. Although the intent-based diversification has been dis- 
cussed in information retrieval (IR), e.g., |5 1 models user intents at 
the topical level of the taxonomy and 1 6 1 obtains the possible query 
intents by mining query logs, they are not always applicable be- 
cause on the one hand, it is not easy to get the useful taxonomy and 
query logs; on the other hand, diversified results are modelled at 
different level, i.e., documents in IR vs. fragments in XML. To the 
best of our knowledge, 1 7 1 is the most relevant work that first maps 
each keyword to a set of attribute-keyword pairs, and then con- 
structs a set of structured queries. It assumes that each structured 
query represents a query interpretation. However, the assumption 
is too strict to be applied for XML data because contextual infor- 
mation may not be necessarily structured, i.e., it may appear in the 
form of either attribute labels or texts. 

The problem of diversifying keyword search is firstly proposed 
and studied in IR community |8i|51|9l[l0l[TT]- Most of the tech- 
niques perform diversification as a post-processing or re-ranking 
step of document retrieval based on the analysis of result set and/or 
the historic query logs. In IR, keyword search diversification is 
designed at the topic or document level. For structured databases 
or semi structured databases, it is necessary to be redesigned at 
the tuple or fragment level. To address the main difference, the 
authors in 1 12 1 propose to navigate SQL results through catego- 
rization, which takes into account user preferences. It consists of 
two steps: the first step analyzes query history of all users in the 
system offline and generates a set of clusters over the data, each 
coiTesponding to one type of user preferences; for an issued query, 
the second step presents to the user a navigational tree over clus- 
ters generated in the first step. By doing this, the user can browse, 
rank, or categorize the results in selected clusters. The authors in 
[13,1 introduce a pre-indexing approach for efficient diversification 
of query results on relational databases based on the prespecified 
diversity orderings among the attributes over relations. The authors 
in LI4J first work out a small number of tuples by choosing one rep- 



resentative from each of clusters and return them in the first page, 
which helps users learn what is available in the whole result set 
and directs them to find what they need. The authors in 1 15 1 differ- 
entiate the keyword search results by comparing their feature sets 
where their feature types are limited to the labels of XML elements 
in the keyword search results. All of these methods can be clas- 
sified as post-process search result analysis. They will encounter 
two challenging problems: the first one is effectiveness because the 
comparison of results will become difficult when the content of a 
result is not too much informative; the second is efficiency because 
they have to compute all the results, analyse and compare them one 
by one. 

To address the above limitations, we initiate a formal study of 
the diversification problem in XML keyword search, which can di- 
rectly compute the diversified results without retrieving all the rele- 
vant candidates. Towards this goal, given a keyword query, we first 
derive the co-related feature terms for each query keyword from 
the XML data based on the mutual information in probability the- 
ory, which has been used as a criterion for feature selection II6I 
|17| . The selection of our feature terms is not limited to the la- 
bels of XML elements. Each combination of the feature terms and 
the original query keywords represents one of diversified contexts 
that express specific search intentions. And then, we evaluate de- 
rived search intentions by considering their relevances to the origi- 
nal keyword query and the novelty of the produced results. To effi- 
ciently compute diversified keyword search, we propose one base- 
line algorithm and two efficient algorithms based on the observed 
properties of diversified keyword search results. 

The remainder of this paper is organized as follows. In Section^ 
we introduce a feature selection model and define the problem of 
diversifying XML keyword search. We describe the procedure of 
extracting the relevant feature terms for a keyword query based on 
the explored feature selection model in Section|3] In Section|4l we 
first show the procedure of generating search intentions from the 
derived feature terms and then propose three efficient algorithms, 
based on the observed properties of XML keyword search results, 
to identify a set of qualified and diversified keyword queries and 
compute their corresponding results. In Section |5] we provide ex- 
tensive experimental results to show the effectiveness of our XML 
keyword search diversification model and the performance of our 
proposed algorithms. We describe the related work in Section [5] 
and conclude in Section]?] 

2. PROBLEM DEFINITION 

Given a keyword query q and an XML data denoted by T, we 
consider a set of possible search intentions Q that are generated 
by bounding each query keyword to a context using its relevant 
feature terms in T. Here, search intentions are also represented in 
the format of keyword query. Naturally, we need present to the 
users the top k qualified queries in tenns of high relevance and 
maximal diversification. 

2.1 Feature Selection Model 

Consider an XML data T and a set of term-pairs W that can 
appear in T. The composition method of W depends on the appli- 
cation context and will not affect our subsequent discussion. As an 
example, it can simply be the full or a subset of the terms compris- 
ing the text in T, the contents of a dictionary, or a well-specified 
set of term-pairs relevant to some applications. 

In this work, the distinct term-pairs are selected based on their 
mutual information as I16| [T71. Mutual information has been used 
as a criterion for feature selection and feature transfomations in 
machine learning. It can be used to characterize both the rele- 



vance and redundancy of variables, such as the minimum redun- 
dancy feature selection. Assume we have an XML tree T and 
its sample result set R{T). Let Prob{x,T) be the probability 
of term x appearing in R(T), i.e., Prob{x,T) = ^"|^^^| where 
\R{x, T) I is the number of results containing x. Let Prob[x, y, T) 
be the probability of terms x and y co-occurring in R{T), i.e., 
Prob{x,y,T) = ^-^^^f^^p^- If terms x and y are independent, 
then knowing x does not give any information about y and vice 
versa, so their mutual information is zero. At the other extreme, if 
terms x and y are identical, then knowing x determines the value 
of y and vice versa. Therefore, the simple measure can be used 
to quantify by how much the observed word co-occurrences that 
maximize the dependency of feature terms while reduce the redun- 
dancy of feature terms. In this work, we use the popularly-accepted 
mutual information model as follows. 



MI{x,y,T) = Prob{x,y,T) 



Prob(x,y,T) 



(1) 



' Prob{x,T)*Prob{y,T) 

For each term in the XML data, we need to find a set of feature 
terms where the feature terms can be selected in any way, e.g., top- 
m feature terms or the feature terms with their mutual values higher 
than a given value based on domain applications or data administra- 
tors. The feature terms can be pre-computed and stored before the 
procedure of query evaluation. Thus, given a keyword query, we 
can obtain a matrix of features for the query keywords against the 
XML data to be searched. The matrix constructs a space of search 
intentions of the original query w.r.t. the XML data. Therefore, our 
first problem is to find a set of feature terms from the matrix, which 
has the highest probability of interpreting the contexts of original 
query. In this work, we extract and evaluate the feature terms at the 
entity level of XML data. 

Table 3: Mutual information score w.r.t. terms in g 



database 


system 


relational 


protein 


distributed 


oriented 


7.06 


3.84 


2.79 


2.25 


2.06 


Mutual score (10~^) 


image 
1.73 


sequence 
1.31 


search 
1.1 


model 
1.04 


large 
1.02 


query 


language 


expansion 


optimization 


evaluation 


complexity 


3.63 


2.97 


2.3 


1.71 


1.41 


Mutual score 


log 
1.17 


efficient 
1.03 


distributed 
0.99 


semantic 
0.86 


translation 
0.70 



Consider query q ={ database, query} over DBLP XML dataset 
again. Its corresponding matrix can be constructed from Table[T] 
Table[3]shows the mutual information score for the query keywords 
in q. Each combination of the feature terms in matrix represents a 
search intention with the specific semantics. For example, the com- 
bination "query expansion database systems " targets to search the 
publications discussing the problem of query expansion in the area 
of database systems, e.g., one of the works, "query expansion for 
information retrieval" published in Encyclopedia of Database Sys- 
tems in 2009, will be returned. If we replace the feature term "sys- 
tems" with "relational", then the generated query will be changed 
to search specific publications of query expansion over relational 
database, in which the returned results are empty because no work 
is reported to the problem over relational database in DBLP dataset. 

2.2 Keyword Search Diversification Model 

In this model, we not only consider the probability of new gener- 
ated queries, i.e., relevance, we also take into account their new and 
distinct results, i.e., novelty. To embody the relevance and novelty 
of keyword search together, two criteria should be satisfied: (1) the 
generated query q^w has the maximal probability to interpret the 
contexts of original query q with regards to the data to be searched; 



and (2) the generated query qnew has a maximal difference from 
the previously generated query set Q. Therefore, we have the ag- 
gregated scoring function. 



score{q„eu,) = Prob{q„ew\q,T) * DIF{q„ew,Q,T) (2) 

where Prob{qnem\q, T) represents the probability that qnew is the 
search intention when the original query q is issued over the data 
T; DIF{qnew, Q, T) represents the percentage of results that are 
produced by qnew, but not by any generated query in Q. 

Firstly, let's show how to calculate the probability Prob{ qnew\q, 
T) of a query qnew that is intended while the user issues q on the 
XML data T. Based on the Bayes Theorem, we have 



Prob{q„ew\q, T) 



Proh{q\qn 



,T)*Prob{q„ 



Prob(q\T) 



(3) 



where Prob{q\qnew ,T) models the likelihood of generating the 
observed queiy q while the intended query is actually qnew', and 
Prob{qnew \T) is the query generation probability given the XML 
data r. 

The likelihood value Prob{q\qnew , T) can be measured by com- 
puting the probability of the original query q that is observed in the 
context of the features in qnew- Given a query q — {ki} and a 
generated new query qnew ~ {si} where ki is a query keyword in 
q. Si is a segment that consists of the query keyword ki and one 
of its features fij. £ P, 1 < i < n, and 1 < ji < m. Here, 
we assume that for each query keyword, only top m most relevant 
features will be retrieved from P to generate new queries. To deal 
with multi-keyword queries, we make the independence assump- 
tion on the probability of generating a query keyword ki while the 
intended feature is actually fij. . That is. 



Prob{q\qnew..T) = JJk, 



Prob{h\f,j^,T) (4) 



According to the statistical information, the intent of a keyword 
can be inferred from the occurrences of the keyword and its cor- 
related terms in the data to be searched. Thus, we can compute 
the probability Prob{ki\fij- , T) of interpreting a keyword ki into 
a search intent . as follows. 

Prob{h\fij. 



r) = 



Prob(fij. \ki,T)*Prob{ki,T) 

Prob(fij ,T) 
\R{{k^Ji,.),T)\/\R{T)\ 

\R{f,j T)\/\R(T)\ 
\R{{s,},T)\ 



(5) 



\RUin'T)\ 



where Si = {ki,fij.}. 

Consider a query q =( database, query} and one of its new queries 
giieMj={database system; query expansion}. Prob{ q\qnew, T) 
show the probability of a publication that addresses the problem of 
"database query" regarding the context of "system and expansion", 

I ■ I 1 i ji I J?,({database system} ,T) I I fqueiy expansion} ,T) I 

which can be computed by - — -h-^, — . — — -■ — 

^ ^ |J?(system.r)| | K(expansion,T) 

Here, |_R({database system}, T) | represents the number of keyword 
search results of query database system} over the data T. The re- 
sult type can be defined by users. In this work, we adopt the widely 
accepted semantics - Smallest Lowest Common Ancestor SLCA to 
model XML keyword search results. Briefly, a node v is regarded 
as an SLCA for a keyword query if (a) the subtree (Tsubi^)) rooted 
at the node v contains all the query keywords; (b) there does not 
exist a descendant node v' of v such that Tsub(v') contains all the 
query keywords. |i?(system, T) \ represents the number of keyword 
search results of running system over the data T, but the number can 
be obtained without running query system because it is equal to the 
size of keyword node list of "system" over T. 



Given tiie XML data T, tiie query generation probability of q„a 
can be calculated by. 



r'rooyqn^n,\i ) — \R(T)\ ^ \R(T)\ ^ ' 



where Hs gq R{si,T) represents the set of SLCA results by 
merging the node lists R{si,T) for Si G q„ew using the algo- 
rithms f3l.r4|. 

Given an original query and the data, the value p^o^^ij) is a rel- 
atively unchanged value with regards to different generated queries. 
Therefore, the above equation can be rewritten as follows. 



into Equation |2] we have the final equation. 



score(q„e 



■'y*Ui 



R(Bi,T) 



in«(«i.r)l 

\R(T)\ 



|{i'x|wxe-R{9„c™,T)Aji.^G{U,'eQ R{q' ,T)}/,v^<vy}\ 
|fl(<j„e™,T)U{U,'eQ«('J'>^)}l 

= TR^*m^^)*\nRis„T)\* 

\{v^\Va:eR{q„^n:,T)Aivye{lJg>eQ R(q' ,T)} Av^ <Vy} \ (9) 



lR(.q^en.,T)[J{[J^,^Q R(q',T)}\ 



R{si,T) 



*\f]R{s^,T)\* 



|{l'xl«x6fl(g„e.^,T)Ajt.„e{U^/gQ R{q' ,T)}Av^<Vy}\ 

l-R(9„e™,T)U{U,'eQ«('?'>^)}l 

where ki € q,Si £ qnew, fij^ £ Si, q' £ Q and the symbol i— >■ rep- 
resents the left side of the equation depends on the right side of the 
equation because the value j^^^^ keeps unchanged for calculating 
the diversification scores of different search intentions. 



ProKwk. r) = 7 * (n mfh) * |5(T)| " (7) 



where ki £ q, Si e qnew, fiji G Si and 7 = Prob(q\T) 

value in (0,1] because it does not affect the refined query selection 

w.r.t. an original keyword query q and data T. 

Although the above equation can model the probability of possi- 
ble intended query, i.e., the relevance between the generated queries 
and the original query w.r.t. the data, different generated queries 
may return overlapped result sets. Therefore, we should also take 
into account the novelty of results produced from the generated 
queries. 

Now let's show how to quantify the value DIF{qnew,Q,T). 
Given a single keyword query, it can be answered based on the 
normal SLCA semantics. However, the normal SLCA semantics 
is not enough to explain the results in this paper because by the 
SLCA result of a query q, we mean the SLCA results of the set 
of all generated queries of q. Let v be an SLCA node returned 
from the previously generated query qpre of q, and v' be an SLCA 
node returned from the newly generated query qnew of q. Similar 
to the normal SLCA semantics, if v' is an ancestor node of v or the 
same as v, v' cannot become a new SLCA node of q. However, 
different from the normal SLCA semantics, if v' is a descendant 
node of v, v will replace w as a new SLCA node of q because v' 
presents more specific semantics that v does. As such, the nov- 
elty DIF{qnew, Q,T) of a new generated query qnew against the 
previously generated query set Q can be calculated as follows. 



DIF{qnew,Q,T) = 

|{i'x|i'xe-R{9„c™,T)ASi.He{U,'eQ R{q' .T)}Av^<vy}\ (8) 
l-R(<7„e™,T)U{U,'eQ r)}| 



where Vx < Vy means that is a duplicate of Vy, i.e., "=", or Vx 
is an ancestor of Vy, i.e., " <"; Ug'gq R{q' ,T) represents the set 
of SLCA results generated by queries in Q, which have to clean 
up duplicate SLCA nodes and ancestor SLCA nodes in the SLCA 
result set of Q. 

As we know our problem is to find top k qualified queries and 
their relevant SLCA results. To do this, we can compute the ab- 
solute score of the search intention for each generated query. To 
reduce the computational cost, an alternative way is to calculate the 
relative scores of queries. Therefore, we have the following equa- 
tion transformation. After we substitute Equation |7]and Equation[8] 



3. EXTRACTING FEATURE TERMS 

Although we can pre-compute and manipulate the co-related terms 
up to any size, the use of two-term co-occurrences presents the most 
reasonable alternative in most applications |18|. In addition, two- 
term co-occurrences can be computed and stored efficiently as de- 
scribed in 1191 . Co-occuiTences of higher order can be utilized at 
the expense of space and, most importantly, time. For the scale 
of the applications we envision, materializing co-occurrences of 
length higher than two is probably infeasible. Therefore, in this 
work, we materialize two-term co-occurrences, which involves the 
computation of a sorted list consisting of triplets {x, y, R(x, y)). 
Every such triplet contains the set of IDs of the results for terms x 
and y. The triplets are sorted based on the result size. If two terms 
do not co-occur, we simply don't store the corresponding tuple. 

To infer the feature terms for a keyword, we first take the entity 
nodes (e.g., the nodes with the node types in XML DTD) as a 
sample space. The inferred terms represent different feature inter- 
pretations of the keyword in the XML data to be searched. For ev- 
ery possible keyword, it has a list of entity IDs where the entity ID 
is encoded using Dewey Coding Scheme. Given a keyword x and 
a tem y, their mutual information score can be calculated based 
on Equation [T] where Prob{x,T) (or Prob{y,T)) is the value of 
dividing the list size of x (or y) by the total entity size of the sample 
space; Prob{{x, y}, T) is the value of dividing the merged result 
size of lists x, y by the total entity size of the sample space. Simi- 
larly, we can compute the mutual information scores of x with the 
other terms. After that, the terms with the top-m mutual informa- 
tion scores will be selected as the m distinct features of the keyword 
X in the sample space. 

To efficiently compute the mutual information score, we only 
need to traverse the XML data tree once. During the XML data 
tree traversal, we first extract the meaningful text information from 
the entity we met. Here, we would like to filter out the stop words, 
which cannot contribute to extract meaningful contexts of search 
intentions. And then we produce a set of term-pairs by scanning 
the extracted text. To maximize the relevance and reduce the re- 
dundancy, we set a limitation that the position of a term in the text 
has maximal three steps to another term that can cooccur in the 
same term-pair. The maximal steps can be varied based on require- 
ments. After that, all the reasonable term-pairs will be generated 
and recorded. When the next entity comes, we will process its text 
in the same way and generate meaningful term-pairs. If a term- 
pair has been recorded beforehand, then we just increase the count 
of the term-pair. Otherwise, a new term-pair will be created and 
recorded. At the same time, the total number of entities will be 
recorded, too. After the XML data tree is traversed completely, we 



can compute the mutual information score for each term-pair based 
on Equation[T] 

4. KEYWORD SEARCH DIVERSIFICATION 
ALGORITHMS 

In this section, we first introduce the procedure of generating a 
new query from the matrix of the original keyword query w.r.t. the 
data to be searched. And then based on the matrix, we propose 
a baseline algorithm to retrieve the diversified keyword search re- 
sults. At last, two anchor-based pruning algorithms are designed 
to improve the efficiency of the keyword search diversification by 
utilizing the intermediate results. 

4.1 Generate Search Intentions 

Given a keyword query g, we first retrieve the corresponding fea- 
ture tenns for each query keyword and then construct a matrix of 
search intentions. In the matrix, the feature terms in each column 
are sorted based on their mutual information scores. Each combina- 
tion of the feature terms (one term per column) represents a search 
intention. We iteratively choose the combination with the maximal 
aggregated mutual information score as the next best search inten- 
tion until the terminal requirements are reached. 

As we discussed above, the aggregated mutual information score 
of each search intention represents to some extent the confidence 
of the context of the query keywords. Without other knowledge, 
we would like to generate the search intentions and then check the 
corresponding queries in descending order by their aggregated mu- 
tual information scores. In this work, we select 20 feature terms 
for each query keyword and then generate all the possible search 
intentions, from which we further identify the top k qualified and 
diversified queries w.r.t. the original query. 

4.2 Baseline Solution 

Given a keyword query, the intuitive idea of baseline algorithm 
is that we first retrieve the pre-computed feature terms of the given 
keyword query from the XML data T; and then we generate all 
the possible intended queries based on the retrieved feature terms; 
at last, we compute the SLCAs as keyword search results for each 
query and measure its diversification score. As such, the top-fc di- 
versified queries and their corresponding results can be returned to 
users. 

Different from traditional XML keyword search, we have to de- 
tect and remove the duplicated or ancestor results by comparing 
the new generated results with the previously generated ones. This 
is because a result may cover multiple search intentions. To meet 
the requirement of keyword search diversification, however, we are 
required to return the distinct SLCA results to the users. 

The detailed procedure has been shown in Algorithm [T] Given a 
keyword query q with n keywords, we first load its pre-computed 
relevant feature terms from the XML data T, which is used to con- 
struct a matrix Mmxn as shown in Line[T] And then, we gen- 
erate a new query q„em from the matrix Mmxn by calling the 
function GenerateNewQueryO as shown in Line|2] The generation 
of new queries are in the descending order of their mutual infor- 
mation scores. Line [SjLine |7] show the procedure of computing 
Prob{q\q„ew ,T). To compute the SLCA results of qnew, we need 
to retrieve the precomputed node lists of the keyword-feature term 
pairs in qnem from T by getN odeLiat{si^j^ , T). Based on the re- 
trieved node lists, we can compute the likelihood of generating the 
observed query q while the intended query is actually q-nsw, i.e., 
Pro6(g|g„e™,T) = rif. . es- ■ e«„„„ (I'^xj^l /getNodeSize 
{fi^jyj T)) using Equation |5] where getNodeSize{fi^j^,T) can 



Algorithm 1 Baseline Algorithm 

input: a query q with n keywords and XML data T 

output: Top-fc search intentions Q and overall result set <1> 



1: i\fmxn = getFeatureTerms(g, T); 

2: while (qnew = GenerateNewQuery(Mmxn)) 7^ null do 

3: 4> = null and prob_s_k = 1; 

4: U^jy = getNodeList(si^jj,, T) for Si^j^ G qnew Al <ix < 
m A 1 < jy < n; 

5: prob_s_k = Uf^^^^ gs,^,^ e,„„„ ( 9etiVodesT/e(k,,„ ,t) ); 

6: (?!. = ComputeSLCA({Z,,,J); 

7: prob_q_new = prob_s_k * 101; 

8: if $ is empty then 

9: score{q„ew) = prob_q_new; 

10: else 

1 1 : for all Result candidates r-j, G do 

12: for all Result candidates ry G "!> do 

13: ifrx == ry or is an ancestor of ry then 

14: (j).remove{rx)', 

15: else if r^: is a descendant of r-y then 

16: <^.remove{ry); 

17: score{qnew) = prob_q_new * [^j^j^; 

18: if IQI < A: then 

19: put : score{qnew) into Q; 

20: put qnew : into $; 

21: else if score (g„e™) > score{{q'„^^ G Q}) then 

22: replace q'„^^ : score{q'„^^) with g„e„ : score{q„ew); 

23: $. remove (g^e„); 

24: return Q and result set <I>; 



be quickly obtained based on the precomputed statistic informa- 
tion of T. After that, we can call for the function ComputeSLCA() 
that can be implemented using any existing XML keyword search 
method. In Line [8]- Line[T6l we compare the SLCA results of the 
current query and the previous queries in order to obtain the dis- 
tinct and diversified SLCA results. At Line [17] we compute the 
final score of q„ew as a diversified query w.r.t. the previously gen- 
erated queries in Q. At last, we compare the new query and the 
previously generated queries and replace the unqualified ones in 
Q, which is shown in Line[T8]- Line [23] After processing all the 
possible queries, we can return the top k generated queries with 
their SLCA results. 

In the worst case, all the possibe queries in the matrix have 
the possibility of being chosen as the top-k qualified query can- 
didates. In this worst case, the complexity of the algorithm is 
0(m'''' * Li Xlli2 ^ogLi) where Li is the shortest node list of any 
generated queiy, |g| is the number of original query keywords and 
m is the size of selected features for each query keyword. In prac- 
tice, the complexity of the algorithm can be reduced by reducing 
the number m of feature terms, which can be used to bound the 
number (i.e., reducing the value of m''') of generated queries. 

4.3 Anchor-based Pruning Solution 

By analysing the baseline solution, we can find that the main cost 
of this solution is spent on computing SLCA results and removing 
unqualified SLCA results from the newly and previously generated 
result sets. To reduce the computational cost, we are motivated 
to design an anchor-based pruning solution, which can avoid the 
unnecessary computational cost of unqualified SLCA results (i.e., 
duplicates and ancestors). In this subsection, we first analyze the 
interrelationships between the intermediate SLCA candidates that 



have been already computed for the generated queries Q and the 
nodes that will be merged for answering the newly generated query 
qnew And then, we will propose the detailed description and algo- 
rithm of the anchor-based pruning solution. 

4. 3. 1 Properties of Computing diversified SLCAs 

Definition 1. (Anchor nodes) Given a set of queries Q that 
have been already processed and a new query qnew, the generated 
SLCA results $ of Q can be taken as the anchors for efficiently 
computing the SLCA results of qnew by partitioning the keyword 
nodes of qnew- 




Figure 1: The usability of anchor nodes 

Example 2. Figure\l]shows the usability of anchor nodes for 
computing SLCA results. Consider two SLCA results Xi and X2 
( assume Xi precedes X2 )for the current query set Q. For the next 
query qnew = {si, S2, S3} and its keyword instance lists L = {isi, 
ls2 , Isi }, the keyword instances in L will be divided into four areas 
by the anchor Xi: (1) Lxi_anc in which all the keyword nodes are 
the ancestor of Xi so Lxi_ane cannot generate new and distinct 
SLCA results; (2) Lxijpre in which all the keyword nodes are the 
previous siblings of X\ so we may generate new SLCA results if 
the results are still bounded in the area; (3) Lxi_des in which all 
the keyword nodes are the descendant of Xi so it may produce new 
SLCA results that will replace Xi; and (4) Lxi_next in which all 
the keyword nodes are the next siblings ofXi so it may produce new 
results, but it may be further divided by the anchor X2. If there is 
no intersection of all keyword node lists in an area, then the nodes 
in this area can be pruned directly, e.g., l^i and 1^2 con be pruned 
without computation if ls3 is empty in Lxi_des. Similarly, we can 
process Lx2_pre, Lx2_des md Lx2_next. After that, a set of new 
and distinct SLCA results can be obtained with regards to the new 
query set Q\J qnew. 

Theorem 1. (Single Anchor) Given an anchor node Va and 
a new query qnew = {si, S2, s„}, its keyword node lists L = 
{Zsj , , Isn } can be divided into four areas to be anchored by 
Va, i.e., the keyword nodes that are the ancestors ofva, denoted as 
Lva_anc; the keyword nodes that are the previous siblings ofva, 
denoted as Ly^ pre; the keyword nodes that are the descendants of 
Va, denoted as L^^ desi ond the keyword nodes that are the next 
siblings of Va, denoted as L^^ next. We have that Lv^ anc does 
not generate any new result; each of the other three areas may 
generate new and distinct SLCA results individually; no new and 
distinct SLCA results can be generated accross the areas. 

For the nodes in the ancestor area, all of them are the ancestors 
of the anchor node Va. It says that the SLCA candidates they can 
produce are also the ancestors of Va. Therefore, the area cannot 



produce any new result due to the existence of Va. For the nodes 
in Ly^ pre (or Lu^_next\ if they can produce SLCA candidates 
that are bounded in the left-bottom (or right-top) area of Va, then 
these candidates are the new SLCA results. This is because there 
are no ancestor-descendant relationship between the nodes in the 
area and Va. For the nodes in L^^ des, if they can produce SLCA 
candidates that are bounded in the right-bottom area of Va, then 
these candidates will be taken as the new and distinct results while 
Va will be removed from result set. 

It is obvious that the nodes in the ancestor area does not take 
part in the generation of SLCA results with the nodes in any other 
area. Now let's show that the keyword nodes coming from two 
of the other areas cannot produce new and distinct SLCA results. 
If we can select some keyword instances from L^^ pre and some 
keyword instances from L^^ des, then their corresponding SLCA 
candidate must be the ancestor of Va, which cannot become the 
new and distinct SLCA result due to the existence of Va in result 
set. Similarly, no result can be generated if we select some key- 
word instances from the other two areas L^^ pre and L^^ next (or 

Lva_des and Ly^ next}. 

Theorem 2. (Multiple Anchors) Given multiple anchor nodes 
Va and a new query qnew = {si, S2, Sn}, its keyword node lists 
L — {lsi,ls2, ■■■,ls„} can be maximally dividedinto {3*\Va\ + l) 
areas to be anchored by the nodes in Va. Only the nodes in (2 * 
I Va I + 1) areas can generate new SLCA candidates individually and 
the nodes in the (2 * + 1) areas are independent to compute 
SLCA candidates. 

Let Va = (xi, Xi, Xj, K|v„|), where Xi precedes Xj 
for i < j. We first partition the space into four areas using xi, 

i.e., Lx i_anc, Lx i pre, L^^ des and Lx^ next' For Lx^ next, We 

partition it further using X2 into four areas. We repeat this process 
until we partition the area x^\Y^\_i)_next into four areas x^y^^ ^ne, 

Lx^y^^_pre, L^n^^i des and „ea;t by USlug X\v^y ObvloUSly 

the total number of partitioned areas is 3 * | Va | + 1. As the ancestor 
areas do not contribute to SLCA results, we end up with 2 * | Va | + 1 
areas that need to be processed. 

Now consider any two different areas Xin and Xj_r2 where ri 
and r2 could be either "pre" or "des" and i < J . If j = | Va | , r2 may 
also be "next". If i = j, then we know r\ 7^ r2. By Theorem [T] 
we know Xin and Xj_r2 cannot produce new and distinct SLCA 
results; If i < j, i.e., Xi precedes Xj, then we know Xj_r2 is a sub 
area of Xi next- By Theorem [T] again, we can get that Xin and 
Xj_r2 cannot produce new and distinct SLCA results. 

Property 1. Consider the (2 * |Va| + 1) effective areas to 
be anchored by the nodes in Va. If^Si £ qnew and none of its 
instances appear in an area, then the area can be pruned because 
it cannot generate SLCA candidates of qnew- 

The reason is that any area that can possibly generate new results 
should contain at least one keyword matched instance for each key- 
word in the query based on the SLCA semantics. Therefore, if an 
area contains no keyword instance, then the area can be removed 
definitely without any computation. 

4. 3. 2 The Anchor-based Pruning Algorithm 

Motivated by the properties of computing diversified SLCAs, we 
design the anchor-based pruning algorithm. The basic idea is de- 
scribed as follows. We generate the first new query and compute 
its corresponding SLCA candidates as a start point. When the next 
new query is generated, we can use the intermediate results of the 
previously generated queries to prune the unnecessary nodes ac- 
cording to the above theorems and property. By doing this, we 



Algorithm 2 Anchor-based Pruning Algorithm 
input: a query q with n Iceywords and XML data T 
output: Top-fc query intentions Q and result set $ 



1; Mmxn = getFeatureTerms(g, T); 

2: while qne.w = GenerateNewQuery(Mmxn) 7^ null do 

3: Line[i}Line|5]in Algorithm[T] 

4: if $ is not empty then 

5: for all Vanchor e $ do 

6: get h^jy_pre, li^jy_des, and U^j^jnext by calling for 

Partition(Zi^jj^ , Ua„chor); 

7: if VZi^jj^ pre 7^ null then 

8: (j}' = CompUteSLCA({ii^jj, pre}, Vanchor); 

9: if V/i^jj^ des / null then 

10: (j)" = CompUteSLCA({Z,^jj, des}, Vanchor); 

11: <^+=<^' + (/)"; 

12: if 0" / null then 

13: <t.rem<yve{vanchor)\ 

14: if „ea:t =null then 

1 5 : Break the FOR-Loop; 

16: U^jy = li^jy_next foi 1 < ix < m A 1 < jy < n; 

17: else 

18: <;/> = ComputeSLCA({«,,j J); 

19: score{q„ew} = prob_q_new * \(l>\* j^jf]^; 

20: Line[i8}Line[23]in Algorithm[TJ 

21: return Q and result set <1>; 



only generate the distinct SLCA candidates every time. That is 
to say, unlike the baseline algorithm, the diversified results can be 
computed directly without further comparison. 

The detailed procedure is shown in Algorithm |2] Similar to 
the baseline algorithm, we need to construct the matrix of feature 
terms, retrieve their conrresponding node lists where the node lists 
can be maintained using R-tree index. And then, we can calculate 
the likelihood of generating the observed query q when the issued 
query is qnew Different from the baseline algorithm, we utilize 
the intermediate SLCA results of previously generated queries as 
the anchors to efficiently compute the new SLCA results for the 
following queries. For the first generated query, we can compute 
the SLCA results using any existing XML keyword search method 
as the baseline algorithm does, shown in Line[T8] Here, we use 
stack-based method to implement the function ComputeSLCA(). 

The results of the first query will be taken as anchors to prune 
the node lists of the next query for reducing its evaluation cost in 
Line|5]- Line[T6l Given an anchor node Vanchor-, for each node list 
li^jy of a query keyword in the current new query, we may get three 
effective node lists li^jy_pre, k^jy_des and h^jy next using R-tree 
index by calling for the function Partition(). If a node list is empty, 
6-g-, h^jy_pre =null, then we don't need to get the node lists for the 
other query keywords in the same area, e.g., in the left-bottom area 
of Vanchor- This Is bccause it cannot produce any SLCA candidates 
at all. Consequently, the nodes in this area cannot generate new and 
distinct SLCA results. If all the node lists have at least one node 
in the same area, then we compute the SLCA results by the func- 
tion CompUteSLCAO, e.g., CompUteSLCA({Zi^jj, des}, Vanchor) 

that merges the nodes in {li^jy_des}- If the SLCA results are the 
descendant of f anchor, then they will be recorded as new distinct 
results and Vanchor will be removed from the temporary result set. 
Through Line[T9l we can obtain the final score of the cuiTent query 
without comparing the SLCA results of with that of $. At last, 
we need to record the score and the results of the new query into Q 
and $, respectively. After all the necessary queries are computed. 



the top-A: diversified queries and their results will be returned. 

4.4 Anchor-based Parallel Sharing Solution 

Although the anchor-based pruning algorithm can avoid unnec- 
essary computation cost of the baseline algorithm, it can be further 
improved by exploiting the parallelism of keyword search diversi- 
fication and reducing the repeated scanning of the same node lists. 

4.4.1 Observations 

According to the semantics of keyword search diversification, 
only the distinct SLCA results need to be returned to users. We 
have the following two observations. 

Observation 1: Anchored by the SLCA result set Va of previ- 
ously processed queries Q, the keyword nodes of the next query 
q„cw can be classified into 2 * jVa| + 1 areas. According to The- 
orem|2] no new and distinct SLCA results can be generated across 
areas. Therefore, the 2* | Va| + 1 areas of nodes can be processed in- 
dependently, i.e., we can compute the SLCA results area by area. It 
can make the parallel keyword search diversification efficient with- 
out communication cost among processors. 

Observation 2: Because there are term overlaps between the 
generated queries, the intermediate partial results of the previously 
processed queries may be reused for evaluating the following queries, 
by which the repeated computations can be avoided. 

4. 4. 2 Anchor-based Parallel Sharing Algorithm 

To make the parallel computing efficiently, we utilize the SLCA 
results of previous queries as the anchors to partition the node lists 
that needs to be computed. By assigning areas to processors, no 
communication cost among the processors is needed. Our proposed 
algorithm guarantees that the results generated by each processor 
are the SLCA results of the cuiTent query. In addition, we also 
take into account the shared partial matches among the generated 
queries, by which we can further improve the efficiency of the al- 
gorithm. 

Different from the above two proposed algorithms, we first gen- 
erate and analyse all the possible queries Qnew Here, we use a 
vector V to maintain the shared query segments among the gen- 
erated queries in Qnew And then, we begin to process the first 
query like the above two algorithms. When the next query is com- 
ing, we will check its shared query segments and explore parallel 
computing to evaluate the query. 

To do this, we first check if the new query qnew contains shared 
parts -i/) in V . For each shared part in xp, we need to check its pro- 
cessing status. If the status has already been set as "processed", 
then it says that the partial matches of the shared part have been 
computed before. In this condition, we only need to retrieve the 
partial matches from previously cached results. Otherwise, it says 
that we haven't computed the partial matches of the shared part 
before. We have to compute the partial matches from the original 
node lists of the shared segments. After that, the processing status 
will be set as "processed". And then, the node lists of the rest query 
segments will be processed. In this algorithm, we also explore par- 
allel computing to improve the performance of query evaluation. At 
the beginning, we specify the maximal number of processors and 
the cuiTent processor's id (denoted as PID). And then, we distribute 
the nodes that need to be computed to the processor with PID in a 
round. When all the required nodes are reached at a processor, the 
processor will be activated to compute the SLCA results of qnew or 
the partial matches for the shared segments in qnew ■ After all ac- 
tive processors complete their SLCA computations, we get the final 
SLCA results of the query qnew At last, we can calculate the score 
of the query qnew and compare its score with the previous ones. If 



its score is larger tlian thiat of one of tlie queries in Q, then the query 
q„ew, its score score{q„ew) and its SLCA results will be recorded. 
After all the necessary queries are processed, we can return the top- 
k qualified and diversified queries and their corresponding SLCA 
results. The detailed algorithm is not provided due to the limited 
space. 

5. EXPERIMENTS 

In this section, we show the extensive experimental results for 
evaluating the performance of our baseline algorithm (denoted as 
Z?aseline evaluation BE) and anchor-based algorithm (denoted as 
anchor-based evaluation AE), which were implemented in Java and 
run on a 3.0GHz Intel Pentium 4 machine with 2GB RAM running 
Windows XP. For our anchor-based parallel sharing algorithm (de- 
noted as ASPE), it was implemented using six computers, which 
can serve as six processors for parallel computation. 

5.1 Dataset and Queries 

We use a real dataset, DBLP 1 20] and a synthetic XML bench- 
mark dataset XMark |21 1 for testing the proposed XML keyword 
search diversification model and our designed algorithms. The size 
of DBLP dataset is 971MB and the size of generated XMark dataset 
is 697MB. Compared with DBLP dataset, the synthetic XMark 
dataset has varied depths and complex data structures, but as we 
know, it does not contain clear semantic information due to its syn- 
thetic data. Therefore, we only use DBLP dataset to measure the 
effectiveness of our diversification model in this work. 

For each XML dataset used, we selected some terms based on the 
following criteria: (I) a selected term should often appear in user- 
typed keyword queries; (2) a selected term should highlight differ- 
ent semantics when it co-occurs with feature terms in different con- 
texts. Based on the criteria of selection, we chose some terms for 
each dataset as follows. For DBLP dataset, we selected "database, 
query, programming, semantic, structure, network, domain, dy- 
namic, parallel" . And we chose "brother, gentleman, look, free, 
king, gender, iron, purpose, honest, sense, metals, petty, Shake- 
speare, weak, opposite" for XMark dataset. By randomly combin- 
ing the selected terms, we generated a set of keyword queries for 
each dataset. Here we limited the length of each keyword query as 
two terms because the short queries often contain more ambiguity 
and can be diversified into more search intentions. 

5.2 Efficiency of Diversification Algorithms 

We show the efficiency of our proposed diversification algorithms 
by selecting 12 keyword queries for each dataset due to the limited 
space. Figure |2] shows their responce time when we do keyword 
search diversifiction over DBLP dataset. According to the exper- 
imental results, BE needs to take about 2.7 seconds to answer an 
original query on average. However, AE and ASPE can finish in 
1.7 seconds and I.I seconds on average, respectively. Compared 
with BE, AE can reduce the response time by about 37% on aver- 
age and ASPE can reduce the response time by 59.2% on average. 
This is because lots of nodes can be pruned without computation. 
For example. Figure [2(g)] shows the response time of the three al- 
gorithms for the query {parallel, database}, from which we can see 
that BE takes about 3.5 seconds, AE takes about 2.4 seconds, and 
ASPE takes about I.I seconds. AE outperforms BE because BE 
needs to process 233,566 nodes while AE only needs to process 
177,069 nodes by pruning 56,497 nodes. ASPE can further im- 
prove the efficiency of keyword search diversification because the 
177,069 nodes can be processed in parallel. Let's consider another 
query {domain, query} in Figure [2(c)] where BE and AE consume 
the similar response time. From the experimental results, we can 



find that BE processes 1 14,418 nodes and AE also needs to process 
110,687. Here, only 3,731 nodes can be pruned. 

Figure [3] shows the experimental results when we do keyword 
search diversification over XMark dataset. Although the efficiency 
of BE is slower than that AE and ASPE, it can still finish each query 
evaluation in 0.7 second. Compared with BE, the improvement 
of AE is not significant because (I) the size of keyword nodes is 
not as large as that of DBLP, e.g., most keyword queries can be 
processed by exploring about 40,000 nodes in XMark dataset while 
it has to explore about 230,000 nodes to answer most queries in 
DBLP dataset; (2) the keyword nodes are distributed evenly in the 
synthetic XMark dataset, which limits the performance of AE and 
ASPE, e.g., for {brother, gentlemen}, BE needs to process 47,000 
nodes while AE still needs to deal with 44,541 nodes. 

From Figure [2] and Figure [3] we can find that the increasing 
number of search intentions just affects the response time of BE, 
AE and ASPE a little. This is because for each original keyword 
query, we first select 20 feature terms and then generate 400 pos- 
sible search intentions. After that, the generated search intentions 
have to be evaluated in the descending order of their mutual infor- 
mation scores. 
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Figure 4: Average time comparison of original queries and 
their search intentions 

For showing the experimental performance fairly, we implement 
the function ComputeSLCA() in BE, AE and ASPE by using the 
brief approach DIL in XRank |2|. Although the other efficient 
SLCA computation algorithms can be applied to our work, they 
will require special processing more or less, which is out of our 
study in this work. Figure [4] shows the average response time of 
evaluating the 14 original queries by using XRank {2 1, and directly 
computing their diversifications by using our proposed BE, AE and 
ASPE. According to the experimental results, we can see that if we 
first compute the keyword search results of the original keyword 
query and then select the diversified results, it will take at least 10 
times of time cost of our proposed diversification approaches. Es- 
pecially, when the users are only interested in a few of contexts 
that are distributed in the data to be searched, our context-based 
keyword queiy diversification algorithms can outperform the post- 
processing diversification approaches greatly. 

5.3 Effectiveness of Diversification Model 

Table [4] shows the top 5 search intentions when we evaluate the 
14 selected keyword queries over DBLP dataset. At the same time, 
we also show the percentages of the recommended top 5 search in- 
tentions recalling for the searched results of their original queries. 
Although the generated query of each search intention may contain 
more keywords than its original query, the results of the generated 
query and its corresponding results of the original query map to 
the same SLCA nodes in most cases. This is because the DBLP 
XML tree is not deep. Therefore, it is fair to compare the SLCA re- 
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Figure 2: 12 DBLP queries for k=5, 10, 15, 20 
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Figure 3: 12 XMARK queries for li=5, 10, 15, 20 



Table 4: Selected top-5 search intentions over DBLP dataset 



original queiy 


top 5 search intentions 


% of original query results 


( semantic queiy ) 


{semantic queiy reformulation } ; {semantic quei'y laneuaees }; {semantic query optimization } ; 
{semantic query processing); [semantic models relational database) 


16.3% 


( structure query ) 


{stmcture query performance ); {spatial stmcture query efficient); {structure query language); 
{structure quei'y reformulation ); {stmcture query evaluation ) 


15.6% 


( domain query } 


{domain query language); {queiy performance specific domain); {specific domain queiy languages ); 
{queiT expansion domain identification ) ; {queiy processing domain database ) 


13% 


{dynamic query} 


{dynamic quei'y languages ); {dynamic quei'y approach); {quei'y processing dynamic database); 
{dynamic quei'y efficient); {dynamic quei'y evaluation); 


12.1% 


*{ parallel query) 


{parallel queiy optimization ); {pai'allel query language ); {ranking queiy pai'allel algorithm ); 
{queiy optimization parallel database); [parallel query performance ) 


11.4% 


{domain database} 


{image database specific domain); [video database specific domain); [relational database multi domain); 
{protein database domain); {object database domain) 


10% 


{database, network} 


{protein database network); {mobile database wireless network); {database management sensor network); 
{distributed database network); {relational database network) 


6.5% 


{semantic database} 


{semantic queiy database approach); {semantic relationships object database); {learning semantic object database); 
{semantic data distributed database) 


6.3% 


{parallel database} 


{parallel processing protein database); (parallel queiy object database); (parallel scheduling database queiy); 
{parallel performance database applications); {pai'allel ODMG object database) 


4.2% 


{dynamic database} 


{dynamic sequence database); {dynamic relational database); {dynamic location database management); 
{dynamic allocation distributed database); {dynamic database approach) 






{protein stmcture network); {stmcture discoveiy network data); {stmcture feature network data); 
{stmcture content neural network); [structure network database ) 


3.3% 


"'{queiy database} 


{queiy language database applications ) ; {queiy processing relational database); {query engine relational database); 
{quei'y optimization object database); [query efficient relational database) 


2.5% 


*{ programming database) 


{logic programming database approach); {logic programming database applications); 
{dynamic programming object database); {linear programming relational database); 
{dynamic programming relational database) 


1.9% 


{parallel network) 


{parallel performance lai'ge network); {parallel database lai'ge network); {parallel perfoi"mance peer network); 
{parallel processing neural network); (parallel scheduling network} 


1.7% 



suits of original queries with those of their generated queries based 
on search intentions over DBLP dataset. Here, we use the average 
number of results and average selective rates to do analysis. For ex- 
ample, {query, database} can produce more than 2000 results but 
{dynamic, database} can only generate 225 results in total. From 
the experimental results, we can see that the selective rates of top 5 
search intentions are no less than 10% for the first 6 queries, in the 
range of (5%-10%) for the next 2 queries; and lower than 5% for 
the rest 6 queries. Based on our diversification model, for each of 
the first 6 queries, we can return 22 of 171 distinct results to match 
with the top 5 search intentions on average. Similarly, we can re- 
turn about 17 of 250 novel results for each of the next 2 queries 
and about 15 of 480 distinct results for each of the last 6 queries 
on average. From the analysis, we can find that our diversification 
model is not biased to the quantity of results, which has a relatively 
fixed selective rates, i.e., the percentages of the original query re- 
sults. Based on the selective rates, it becomes more practicable and 
easier for users to go through the small number of returned din- 
stinct results, rather than be overwhelmed with the large number of 
all results. 

Besides the above analysis at high level, now let's take a look of 
three cases in more details. For {parallel query}, the queries of top 
5 selected search intentions can return the publications that discuss 
the research problems in the different contexts: (1) parallel query 
optimization that highlights query optimization problem in general 
parallel systems or algorithms; (2) parallel query language that 
highlights the query language models in general parallel systems 
or algorithms; (3) ranking query parallel algorithm that highlights 
the ranking query in parallel algorithms; (4) query optimization 
parallel database that focuses on the query optimization problem 
on parallel databases; (5) parallel query performance that focuses 
on the query performance analysis in parallel systems or platforms. 
Although the context of (4) overlaps with that of (1), the search in- 
tention in (4) is clearer than (1) because (4) specifies the database 



area. For the given query {programming database}, the search in- 
tentions can be identified by highlighting the context of program- 
ming, e.g., logic, linear and dynamic, and the context of database, 
e.g., relational and object. For the given query {query database}, it 
highlights the search intentions in the contexts: query language in 
database application; query system engine over relational database; 
query processing and efficiency over relational database; and the 
query optimization techniques in object database. 

6. RELATED WORK 

Recently, diversifying results of document retrieval has been in- 
troduced l8ll5l[9| [T0l . Most of the techniques perform diversifica- 
tion as a post-processing or re-ranking step of document retrieval. 
These techniques first retrieve relevant results and then filter or re- 
order the result list to achieve diversification. [8 | is a classic ex- 
ample of such a strategy, which can be employed to re-rank docu- 
ments and promote diversity based on maximal marginal relevance 
(MMR). In 1 5 1, the search results are diversified through categoriza- 
tion according to the existing taxonomy of infoiTnation, in which 
user intents are modelled at the topical level of the taxonomy. In 
1 9 1, Chen and Karger use Bayesian retrieval models and condition 
selection of subsequent documents by making assumptions about 
the relevance of the previously retrieved documents. While their 
approach is capable of selecting anywhere between < fc < n 
relevant documents, they focus primarily on optimizing single doc- 
ument (k=l) and perfect precision (k=n) scenarios. Their model 
does not explicitly consider user intent or document categoriza- 
tions. In |T0|, the authors also consider meaningful ways (Clas- 
sic ranked retrieval metrics) to evaluate the performance of search 
diversification and subtopic retrieval algorithms, such as normal- 
ized discounted cumulative gain (NDCG), mean average precision 
(MAP), and mean reciprocal rank (MRR) are discussed by taking 
user intent into account. However, the above work is predicated 



implicitly on the assumption that a single relevant document will 
fulfill a user's information need, making them inadequate for many 
information queries. To fix the research gap, the authors in 0221 
present a search diversification algorithm particularly suitable for 
informational queries by explicitly modeling that the user may need 
more than one page to satisfy her/his need. However, the majority 
of the above work focus on the effectiveness of diversifying search 
results. To improve the efficiency of document retrieval with the 
consideration of diversification, the authors in fll] propose an ef- 
ficient algorithm for diversity-aware search based on low-overhead 
data access prioritization scheme with theoretical quality guaran- 
tees. 

Another conventional approach to achieve diversification is clus- 
tering or classification of search results. By grouping search results 
based on similarity, users can navigate to the right groups to re- 
trieve the desired information. Clustering and classification have 
been applied to document retrieval |5|, image retrieval [23 1, and 
database query results II12I IT4] 1241 . Similar to result re-ranking, 
clustering is usually performed as a post-processing step, and is 
computationally expensive. In addition, the common approach of 
clustering the most relevant documents, and presenting at most one 
document per cluster, corresponds to a specific, and very limited 
form of diversification semantics. 

Apart from the above approaches, there are another two work for 
addressing diversification of search results for document retrieval 
at the experimental level. t25il formally explores the axioms that 
any diversification system should be expected to satisfy, which can 
be taken as some basis of comparison between different objective 
functions for diversifying searched results. [26] focuses on a the- 
oretical development of the economic portfolio theory for docu- 
ment ranking. Their model considers a "risk" tradeoff between the 
expected relevance of a set of documents and correlation between 
them, modeled as the mean and variance. By balancing the over- 
all relevance of the list against its risk (variance), top-n documents 
will be selected and ordered. 

7. CONCLUSIONS 

In this paper, we first presented an approach to search diversified 
results of keyword query from XML data based on the contexts of 
the query keywords in the data. The diversification of the contexts 
were measured by exploring their relevance to the original query 
and the novelty of their results. Furthermore, we designed three 
efficient algorithms based on the observed properties of XML key- 
word search results. Finally, we demonstrated the efficiency of our 
proposed algorithms by running substantial number of queries over 
both DBLP and XMark datasets. Meanwhile, we also verified the 
effectiveness of our diversification model by analyzing the returned 
search intentions for the given keyword queries over DBLP dataset. 
From the experimental results, we get that our proposed diversifi- 
cation algorithms can return qualified search intentions and results 
to users in a short time. 
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